Despite the improvement of sequencing methods, there is no error-free technique. A correct measuring of the sequencing quality is essential for identifying problems in the sequencing, thus, this must be the first step in every sequencing analysis. Once the quality control is finished, it's important to remove those low quality reads, or short reads, for which a trimming step is mandatory. After the trimming step it is recommended to perform a new quality control step to be sure that trimming worked.
| Title | Pre-processing |
|---|---|
| Training dataset: | PRJEB43037 - In August 2020, an outbreak of West Nile Virus affected 71 people with meningoencephalitis in Andalusia and 6 more cases in Extremadura (south-west of Spain), causing a total of eight deaths. The virus belonged to the lineage 1 and was relatively similar to previous outbreaks occurred in the Mediterranean region. Here, we present a detailed analysis of the outbreak, including an extensive phylogenetic study. This is one of the outbreak samples. |
| Questions: |
|
| Objectives: |
|
| Estimated time: | 25 min |
To run the quality control over the samples, follow these steps: 1. Create a new history, as we explained yesterday named Illumina preprocessing 2. Upload data as seen yesterday, copy and paste the following URLs:
ftp://ftp.sra.ebi.ac.uk/vol1/fastq/ERR531/002/ERR5310322/ERR5310322_1.fastq.gz
ftp://ftp.sra.ebi.ac.uk/vol1/fastq/ERR531/002/ERR5310322/ERR5310322_2.fastq.gz
To see the results we are going to open the jobs with Web page in their name for both data 1 and data 2.
Here, you can see the number of reads in each file, the maximum and minimum length of all reads in the sample, and the quality plots for both R1 and R2. They look quite good, but we are going to run trimming over the samples.
How many reads do the samples have?
265989
First question
How do I check whether my Illumina data was correctly sequenced?
Using FastQC
Once we have performed the quality control, we have to perform the quality and read length trimming:
To see the trimming stats, have a look at the fastp on data 2 and data 1: HTML report file. You should see something like that.
How many reads have we lost?
98664 reads
Trimmomatic does not perform statistics over trimmed reads, so we need to perform FastQC again over the Trimmomatic results.
Try to do it on your own.
Second question
How can I improve the quality of my data?
Using a trimming software, such as fastp or trimmomatic.
| Title | Galaxy |
|---|---|
| Training dataset: | The data we are going to manage corresponds to Nanopore amplicon sequencing data using ARTIC network primers por SARS-CoV-2 genome. From the Fast5 files generated by the ONT software, we are going to select the pass reads, so they are already filtered by quality. |
| Questions: |
|
| Objectives: |
|
| Estimated time: | 15 min |
To run the quality control over the samples, follow these steps: 1. Create a new history has explained yesterday named Nanopore quality 2. Upload data as seen yesterday, copy and paste the following URLs:
https://raw.githubusercontent.com/nf-core/test-datasets/viralrecon/nanopore/minion/fastq_pass/barcode01/FAO93606_pass_barcode01_7650855b_0.fastq
https://raw.githubusercontent.com/nf-core/test-datasets/viralrecon/nanopore/minion/fastq_pass/barcode01/FAO93606_pass_barcode01_7650855b_1.fastq
https://raw.githubusercontent.com/nf-core/test-datasets/viralrecon/nanopore/minion/fastq_pass/barcode01/FAO93606_pass_barcode01_7650855b_2.fastq
Now we are going to have a look to the results.
As you can see, the Mean read length is around 500 nt, which makes sense because we are using amplicon sequencing data.
How many reads do the samples have?
3K reads
First question
How do I check whether my Nanopore data was correctly sequenced?
Using NanoPlot and having a look to the main read length.
When Nanopore reads are being sequenced, the MinKnown software splits Fast5 reads into quality pass and quality fail. As we will select only Fast5 pass reads, we won't need to perform a quality trimming, so even if we see that the reads have a bad Phred score, we know that the ONT software considered the reads as "good quality".
Then we will only be performing a read length trimming. As we are using amplicon sequencing data, we won't be expecting reads smaller than 400 nucleotides, nor higher than 600, which would obviously correspond to chimeric reads.
We will come across one error in this job:
This happens because Galaxy does not have the software to filter SARS-CoV-2 amplicon data properly installed in their server, which is something typical that we can find in Galaxy.